In [1]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

Common data formats:

JSON
csv
matlab
excel spreadsheets
binary formats
XML
HDF
SQL

http://stats.stackexchange.com/questions/27237/what-are-the-most-useful-sources-of-economics-data

Will use here:

https://research.stlouisfed.org/fred2/

JSON (Javascript object notation)¶

http://www.json.org/

easy for humans to read and write
easy for machines to parse and generate

In [2]:

# Using FRED requires a personal key
# To avoid exposing my key, I have it in a separate file
from keys import fred_key

In [53]:

series_id = 'GNPCA'
request_url = 'http://api.stlouisfed.org/fred/series?series_id=' + series_id + '&api_key=' + fred_key + '&file_type=json'

import urllib2
f = urllib2.urlopen(request_url)
data = f.read()
data

Out[53]:

'{"realtime_start":"2016-04-07","realtime_end":"2016-04-07","seriess":[{"id":"GNPCA","realtime_start":"2016-04-07","realtime_end":"2016-04-07","title":"Real Gross National Product","observation_start":"1929-01-01","observation_end":"2015-01-01","frequency":"Annual","frequency_short":"A","units":"Billions of Chained 2009 Dollars","units_short":"Bil. of Chn. 2009 $","seasonal_adjustment":"Not Seasonally Adjusted","seasonal_adjustment_short":"NSA","last_updated":"2016-03-25 07:56:03-05","popularity":35,"notes":"BEA Account Code: A001RX1"}]}'

In [54]:

import json
json_data = json.loads(data)

In [55]:

type(json_data)

Out[55]:

dict

In [56]:

json_data.keys()

Out[56]:

[u'seriess', u'realtime_start', u'realtime_end']

In [57]:

json_data[u'realtime_start'], json_data[u'realtime_end']

Out[57]:

(u'2016-04-07', u'2016-04-07')

In [58]:

json_data[u'seriess']

Out[58]:

[{u'frequency': u'Annual',
  u'frequency_short': u'A',
  u'id': u'GNPCA',
  u'last_updated': u'2016-03-25 07:56:03-05',
  u'notes': u'BEA Account Code: A001RX1',
  u'observation_end': u'2015-01-01',
  u'observation_start': u'1929-01-01',
  u'popularity': 35,
  u'realtime_end': u'2016-04-07',
  u'realtime_start': u'2016-04-07',
  u'seasonal_adjustment': u'Not Seasonally Adjusted',
  u'seasonal_adjustment_short': u'NSA',
  u'title': u'Real Gross National Product',
  u'units': u'Billions of Chained 2009 Dollars',
  u'units_short': u'Bil. of Chn. 2009 $'}]

Other series: https://research.stlouisfed.org/fred2/tags/series

API docs:
https://research.stlouisfed.org/docs/api/fred/

Now let's get some data:

In [59]:

request_url = 'http://api.stlouisfed.org/fred/series/observations?series_id=' + series_id + '&api_key=' + fred_key + '&file_type=json'

import urllib2
f = urllib2.urlopen(request_url)
data = f.read()

In [60]:

json_data = json.loads(data)
json_data.keys()

Out[60]:

[u'count',
 u'order_by',
 u'observation_start',
 u'file_type',
 u'observation_end',
 u'realtime_end',
 u'sort_order',
 u'limit',
 u'observations',
 u'offset',
 u'units',
 u'output_type',
 u'realtime_start']

In [61]:

json_data[u'count'],json_data[u'order_by'], json_data[u'observation_start'], json_data[ u'file_type'], json_data[ u'observation_end']

Out[61]:

(87, u'observation_date', u'1776-07-04', u'json', u'9999-12-31')

In [62]:

json_data[u'realtime_start'], json_data[u'realtime_end'], json_data[u'sort_order'], json_data[u'limit'], json_data[u'offset']

Out[62]:

(u'2016-04-07', u'2016-04-07', u'asc', 100000, 0)

In [63]:

json_data[u'output_type'],  json_data[u'units']

Out[63]:

(1, u'lin')

In [64]:

values = []

for o in json_data['observations']:
    values.append(float(o['value']))

Should use something like the pandas module to ensure the data is consistent (e.g. the time series is equally spaced, ordered and there are no mising time values)

In [65]:

plot(values)

Out[65]:

[<matplotlib.lines.Line2D at 0x7f72f0036ad0>]

In [66]:

f = open('../shared/gnpca.txt', 'w')

for v in values:
    f.write(str(v) + '\n')
    
f.close()

Other sets:

In [67]:

series_id = 'UNRATE'

request_url = 'http://api.stlouisfed.org/fred/series/observations?series_id=' + series_id + '&api_key=' + fred_key + '&file_type=json'

f = urllib2.urlopen(request_url)
data = f.read()
json_data = json.loads(data)

values = []
for o in json_data['observations']:
    values.append(float(o['value']))
    
plot(values)

Out[67]:

[<matplotlib.lines.Line2D at 0x7f72e9dabc10>]

In [68]:

f = open('../shared/unrate.txt', 'w')

for v in values:
    f.write(str(v) + '\n')
    
f.close()

In [69]:

series_id = 'GS10'

request_url = 'http://api.stlouisfed.org/fred/series/observations?series_id=' + series_id + '&api_key=' + fred_key + '&file_type=json'

f = urllib2.urlopen(request_url)
data = f.read()
json_data = json.loads(data)

values = []
for o in json_data['observations']:
    values.append(float(o['value']))
    
plot(values)

Out[69]:

[<matplotlib.lines.Line2D at 0x7f72e9ce65d0>]

In [70]:

f = open('../shared/gs10.txt', 'w')

for v in values:
    f.write(str(v) + '\n')
    
f.close()

CSV (Comma Separated Values)¶

In [21]:

request_url = 'http://ichart.finance.yahoo.com/table.csv?s=^GSPC&ignore=.csv'

f = urllib2.urlopen(request_url)
data = f.read()

In [22]:

import csv
parsed_csv = csv.reader(data.split('\n'))

In [23]:

type(parsed_csv)

Out[23]:

_csv.reader

In [24]:

parsed_csv.next()

Out[24]:

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

In [25]:

parsed_csv.next()

Out[25]:

['2016-04-06',
 '2045.560059',
 '2067.330078',
 '2043.089966',
 '2066.659912',
 '3750800000',
 '2066.659912']

In [26]:

parsed_csv.next()

Out[26]:

['2016-04-05',
 '2062.50',
 '2062.50',
 '2042.560059',
 '2045.170044',
 '4154920000',
 '2045.170044']

In [27]:

highs = []
for row in parsed_csv:
    if len(row) > 0:
        highs.append(float(row[2]))

In [28]:

plot(highs)

Out[28]:

[<matplotlib.lines.Line2D at 0x7f72f0a464d0>]

https://finance.yahoo.com/echarts?s=%5EGSPC+Interactive#%7B%22range%22:%22max%22,%22allowChartStacking%22:true%7D

In [29]:

plot(highs[::-1])

Out[29]:

[<matplotlib.lines.Line2D at 0x7f72f0bf08d0>]

MATLAB matrix format¶

A long list of datasets: http://www.models.life.ku.dk/datasets

In [30]:

import urllib
urllib.urlretrieve('http://www.eigenvector.com/data/Corn/corn.mat', 'corn.mat')

Out[30]:

('corn.mat', <httplib.HTTPMessage instance at 0x7f72f30060e0>)

http://www.eigenvector.com/data/Corn/index.html

In [31]:

from scipy.io import loadmat
corn = loadmat('corn.mat')

In [32]:

type(corn)

Out[32]:

dict

In [33]:

corn.keys()

Out[33]:

['mp6nbs',
 'information',
 'propvals',
 '__globals__',
 'm5nbs',
 'mp6spec',
 'm5spec',
 'mp5spec',
 '__header__',
 '__version__',
 'mp5nbs']

In [34]:

corn['mp6spec'][0][0][7]

Out[34]:

array([[-0.0227014, -0.0228025, -0.0228795, ...,  0.675079 ,  0.674679 ,
         0.674056 ],
       [-0.0219211, -0.0220554, -0.0221607, ...,  0.682942 ,  0.682648 ,
         0.682164 ],
       [-0.0208596, -0.0209931, -0.0211072, ...,  0.652276 ,  0.651984 ,
         0.651517 ],
       ..., 
       [-0.0178645, -0.0179813, -0.0180715, ...,  0.695484 ,  0.695075 ,
         0.694381 ],
       [-0.0067957, -0.0068881, -0.0069559, ...,  0.690173 ,  0.689855 ,
         0.689125 ],
       [-0.0152611, -0.0153799, -0.0154608, ...,  0.703188 ,  0.70277  ,
         0.702071 ]])

In [35]:

plot(corn['mp6spec'][0][0][7][0,:])

Out[35]:

[<matplotlib.lines.Line2D at 0x7f72f08f7190>]

Excel sheets¶

Various ways to open excel sheets in python:

http://stackoverflow.com/questions/3239207/how-can-i-open-an-excel-file-in-python

Binary formats¶

Data can be encoded "raw" in a file. i.e. the file just contains bytes that are copied directly into memory. The user must know the format and data type of the data

In [36]:

fib = [1,1,2,3,5,8,13]

In [37]:

import struct
enc = struct.pack('i', 10)
enc

Out[37]:

'\n\x00\x00\x00'

In [38]:

hex(10)

Out[38]:

'0xa'

In [39]:

chr(10)

Out[39]:

'\n'

In [40]:

len(enc)

Out[40]:

In [41]:

ord('\n'), ord('a')

Out[41]:

(10, 97)

In [42]:

'i'*len(fib)

Out[42]:

'iiiiiii'

In [43]:

bytes = struct.pack('i'*len(fib), *fib)

In [44]:

bytes

Out[44]:

'\x01\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x05\x00\x00\x00\x08\x00\x00\x00\r\x00\x00\x00'

In [45]:

fib = struct.unpack('iiiiiii', bytes)

In [46]:

fib

Out[46]:

(1, 1, 2, 3, 5, 8, 13)

In [47]:

fib_f = struct.unpack('fffffff', bytes)

In [48]:

fib_f

Out[48]:

(1.401298464324817e-45,
 1.401298464324817e-45,
 2.802596928649634e-45,
 4.203895392974451e-45,
 7.006492321624085e-45,
 1.1210387714598537e-44,
 1.8216880036222622e-44)

In [49]:

bytes = struct.pack('f'*len(fib), *fib)

In [50]:

bytes

Out[50]:

'\x00\x00\x80?\x00\x00\x80?\x00\x00\x00@\x00\x00@@\x00\x00\xa0@\x00\x00\x00A\x00\x00PA'

In [51]:

fib_f = struct.unpack('fffffff', bytes)

In [52]:

fib_f

Out[52]:

(1.0, 1.0, 2.0, 3.0, 5.0, 8.0, 13.0)

XML¶

[Earthquake XML.ipynb](Earthquake XML.ipynb)

HDF (Hierarchical Data Format)¶

https://www.hdfgroup.org/

Common format for "big data". Specifically designed for large multi-dimensional data sets.

https://www.hdfgroup.org/tools/

SQL (Structured Query Language)¶

Several options depending on need:

https://docs.python.org/2/library/sqlite3.html

By: Andrés Cabrera mantaraya36@gmail.com

For Course MAT 240F at UCSB

This ipython notebook is licensed under the CC-BY-NC-SA license: http://creativecommons.org/licenses/by-nc-sa/4.0/